Comparison of Clustering Methods over a Hidden Web Data using Stratification

نویسندگان

  • G. Jaya Suma
  • R. Manjula
چکیده

This paper’s centre of attention is on the problem of data mining (in general) and clustering (in specific) on a hidden web data. We know that data mining is a process that analyzes and extracts knowledge from large amounts of data which provides useful information to users. Hidden or deep web data is the database located at remote system .So, to access such data, we need query interface or HTML forms. Clustering such type of data is difficult as it is limited to indirect access through query interface and requires more time to access. A novel methodology stratified clustering introduced through sampling of datasets. The samples can only be obtained by submitting queries. It is required to apply efficient sampling method to reduce time consumption and number of queries required to access deep web data. This paper proposes series of steps to accomplish the task.1) the space of input attributes are categorized into stratum that represents the association between input and output attributes.2) Efficient sampling method proposed to obtain high estimation accuracy .3) the samples obtained are used by two clustering methods, stratified k-means clustering and hierarchical clustering. The estimation accuracy of cluster centers of deep web data are compared for simple random sampling against stratified sampling and k-means clustering method against hierarchical clustering method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stratification Based Hierarchical Clustering Over a Deep Web Data Source

This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by ...

متن کامل

Outlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means

One of the most important concerns of a data miner is always to have accurate and error-free data. Data that does not contain human errors and whose records are full and contain correct data. In this paper, a new learning model based on an extreme learning machine neural network is proposed for outlier detection. The function of neural networks depends on various parameters such as the structur...

متن کامل

Coronavirus: Discover the Structure of Global Knowledge, Hidden Patterns & Emerging Events

Background & Objective:  The present study aimed at exploring the structure of global knowledge, hidden patterns, and emerging Coronavirus events using co-word techniques. Co-word analysis is one of the most efficient scientific methods to analyze the structure and dynamics of knowledge and the general state of research.  Materials & Methods:  This applied research performed using Co-word anal...

متن کامل

The ensemble clustering with maximize diversity using evolutionary optimization algorithms

Data clustering is one of the main steps in data mining, which is responsible for exploring hidden patterns in non-tagged data. Due to the complexity of the problem and the weakness of the basic clustering methods, most studies today are guided by clustering ensemble methods. Diversity in primary results is one of the most important factors that can affect the quality of the final results. Also...

متن کامل

A density based clustering approach to distinguish between web robot and human requests to a web server

Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013